policy evaluation
Multi-Objective Reinforcement Learning with Max-Min Criterion: AGame-Theoretic Approach
In this paper, we propose a provably convergent and practical framework for multi-objective reinforcement learning with max-min criterion. From a game-theoretic perspective, we reformulate max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game and introduce an efficient algorithm based on mirror descent.
Causal Explanation-Guided Learning for Organ Allocation
A central challenge in organ transplantation is the extremely low acceptance rate of donor organ offers--typically in the single digits--leading to high discard rates and suboptimal use of available grafts. Current acceptance models embedded in allocation systems are non-causal, trained on observational data, and fail to generalize to policy-relevant counterfactuals. This limits their reliability for both policy evaluation and simulator-based optimization. In this work, we reframe organ offer acceptance as a counterfactual prediction problem and propose a method to learn from routinely recorded--but often overlooked--refusal explanations. These refusal reasons act as direction-only counterfactual signals: for example, a refusal reason such as old donor age implies acceptance might have occurred had the donor been younger. We formalize this setting and introduce ClexNet, a novel causal model that learns policy-invariant representations via balanced training and an explanation-guided augmentation loss. On both synthetic and semi-synthetic data, ClexNet outperforms existing acceptance models in predictive performance, generalization, and calibration, offering a robust drop-in improvement for simulators and allocation policy evaluation. Beyond transplantation, our approach provides a general method for incorporating human direction-only explanations as a form of model supervision, improving performance in settings where only observational data is available.
Safe and Efficient Off-Policy Reinforcement Learning
Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(ฮป), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(ฮป), which was an open problem since 1989. We illustrate the benefits of Retrace(ฮป) on a standard suite of Atari 2600 games. One fundamental trade-off in reinforcement learning lies in the definition of the update target: should one estimate Monte Carlo returns or bootstrap from an existing Q-function?
OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators
Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.